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Abstract 

An estimated 80% of genomic DNA in eukaryotes is packaged as nucleosomes, which, together with the remaining 
interstitial linker regions, generate higher order chromatin structures [1]. Nucleosome sequences isolated from diverse 
organisms exhibit ~10 bp periodic variations in AA, TT and GC dinucleotide frequencies. These sequence elements 
generate intrinsically curved DNA and help establish the histone-DNA interface. We investigated an important unanswered 
question concerning the interplay between chromatin organization and genome evolution: do the DNA sequence 
preferences inherent to the highly conserved histone core exert detectable natural selection on genomic divergence and 
polymorphism? To address this hypothesis, we isolated nucleosomal DNA sequences from Drosophila melanogaster 
embryos and examined the underlying genomic variation within and between species. We found that divergence along the 
D. melanogaster lineage is periodic across nucleosome regions with base changes following preferred nucleotides, 
providing new evidence for systematic evolutionary forces in the generation and maintenance of nucleosome-associated 
dinucleotide periodicities. Further, Single Nucleotide Polymorphism (SNP) frequency spectra show striking periodicities 
across nucleosomal regions, paralleling divergence patterns. Preferred alleles occur at higher frequencies in natural 
populations, consistent with a central role for natural selection. These patterns are stronger for nucleosomes in introns than 
in intergenic regions, suggesting selection is stronger in transcribed regions where nucleosomes undergo more 
displacement, remodeling and functional modification. In addition, we observe a large-scale ( — 1 80 bp) periodic enrichment 
of AA/TT dinucleotides associated with nucleosome occupancy, while GC dinucleotide frequency peaks in linker regions. 
Divergence and polymorphism data also support a role for natural selection in the generation and maintenance of these 
super-nucleosomal patterns. Our results demonstrate that nucleosome-associated sequence periodicities are under 
selective pressure, implying that structural interactions between nucleosomes and DNA sequence shape sequence 
evolution, particularly in introns. 
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Introduction 

Sequence-dependent differences in the physical properties of 
DNA influence its associations with the histone core, as well as the 
kinetics of nucleosome assembly and stability [2-9] . One of the 
most generalizable sequence affinities of the histone octamer is the 
periodic variation of dinucleotide frequencies across nucleosomal 
DNA. Alignments of nucleosomal sequences from diverse eukary- 
otes display a prominent ~ 1 0 bp periodic enrichment of AT-rich 
dinucleotides, along with an anti-correlated periodicity of GC-rich 
dinucleotides [8,10-14]. The ~10bp spacing of AA/TT dinu- 
cleotides generates intrinsically curved DNA molecules with 
increased nucleosome binding affinity [5,8,14-17]. Peaks of AA/ 
TT frequency are found specifically over positions where the 
minor groove bends interiorly, whereas GC dinucleotides peak 
where the major groove is facing the histone core. Structural data 
suggest that DNA shape, in particular the narrowing of the minor 
groove and the associated lowering of its electrostatic potential at 
AT-rich sequences facilitate contacts with key histone arginines 
[9,18,19]. GC dinucleotides contract the major groove, which also 
facilitates the tight winding of DNA around the core [9,20] . 



Although these broadly conserved dinucleotide patterns have 
been cited as evidence for a genomic "code" for nucleosome 
positioning [10], the role of sequence in nucleosome function 
remains contested and unresolved [6,8,21,22]. Correlation 
between in vitro and in vivo nucleosome maps in yeast may 
reflect the influence of the inherent sequence preferences of the 
histone core on nucleosome positioning [7,23]. However, strong 
experimental evidence suggests that trans-acting factors (e.g. RNA 
polymerase II, transcription factors and ATP-dependent remo- 
delers) are central to establishing nucleosome positions along 
genomic DNA (translational positions), particularly in genie 
regions, with sequence providing a weaker contribution 
[7,8,21,24,25]. In cases where DNA sequence does impact 
translational nucleosome positions, its influence is largely attrib- 
uted to GC content and anti-nucleosomal sequences, such as poly- 
dA/dT tracts, rather than dinucleotide patterns [5,8,25-27]. 

Dinucleotides are instead thought to play a distinct but 
integrally connected role in directing and preserving the 
'rotational positioning' of nucleosomal DNA [8,9,20,28], which 
refers to the orientation of DNA relative to the core. Due to the 
structural constraints inherent to nucleosome formation, a given 
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Author Summary 

In eukaryotic cells, the majority of DNA is packaged in 
nucleosomes comprised of ~147 bp of DNA wound 
tightly around the highly conserved histone octamer. 
Nucleosomal DNA from diverse organisms shows an anti- 
correlated ~10 bp periodicity of AT-rich and GC-rich 
dinucleotides. These sequence features influence DNA 
bending and shape, facilitating structural interactions. We 
asked whether natural selection mediated through the 
periodic sequence preferences of nucleosomes shapes the 
evolution of non-protein-coding regions of D. melanoga- 
ster by examining the inter- and intra-species genomic 
variation relative to these fundamental chromatin building 
blocks. The sequence changes across nucleosome-bound 
regions on the melanogaster lineage mirror the observed 
nucleosome dinucleotide periodicities. Importantly, we 
show that the frequencies of polymorphisms in natural 
populations vary across these regions, paralleling diver- 
gence, with higher frequencies of preferred alleles. These 
patterns are most evident for intronic regions and indicate 
that non-protein coding regions are evolving toward 
sequences that facilitate the canonical association with 
the histone core. This result is consistent with the 
hypothesis that interactions between DNA and the core 
have systematic impacts on function that are subject to 
natural selection and are not solely due to mutational bias. 
These ubiquitous interactions with the histone core 
partially account for the evolutionary constraint observed 
in unannotated genomic regions, and may drive broad 
changes in base composition. 

translational position in the genome will assemble with a particular 
rotational alignment. This determines which bases face the 
nucleosome interior and exterior, and also the positioning of the 
major and minor grooves relative to the core. Nucleosomes tend to 
occupy translational genomic positions which are offset by ~ 1 0 bp 
increments [13,28,29]. Thus, due to the helical structure of DNA, 
with — 10.4 bp per turn, the rotational orientation of DNA relative 
to the core is thought to be unchanged as nucleosomes assume new 
favored translational positions (Figure 1A). This 10 bp incremental 
movement leaves the exposure of sites at the surface unchanged 
[20,28], and is in agreement with the reported step size of many 
chromatin remodelers [30]. By influencing the rotational posi- 
tioning of DNA relative to the histone core, nucleotide changes at 
particular nucleosome positions (or in flanking regions) could have 
diverse functional impact, for example on nucleosome assembly, 
stability, remodeling efficiency, RNA and DNA polymerase 
processivities and transcription factor binding site access. Howev- 
er, despite considerable evidence that dinucleotide patterns impact 
nucleosome positioning and dynamics in vitro, in vivo evidence of 
function has remained elusive. 

One approach to discovering function is to look for evidence of 
natural selection in sequence polymorphism (variation within 
species) and divergence (variation between species). Individual 
mutations influencing histone-DNA interactions may have only 
slight, undetectable phenotypic effects in the laboratory; in 
contrast, the associated fitness consequences in large natural 
populations can strongly shape rates of divergence and levels of 
polymorphism over many generations [31]. Of course, strongly 
selected variants will go to fixation quickly and be maintained in 
very high frequency against the weaker force of mutational 
reversion. However, observations of extensive DNA sequence 
polymorphism and divergence throughout the genome, including 
nucleosomal sequences, indicate that such systematic selection is 



not dominating stochastic effects (mutation, genetic drift, and 
variation in selection coefficients) in the evolutionary dynamic. 

Analysis of codon bias suggests that at equilibrium between 
selection, mutation and genetic drift, the ratio of the frequencies of 
two alternative synonymous codons throughout a single genome 
can be used to estimate the direction and magnitude of selection 
[32]. The action of natural selection can be inferred when 
synonymous codon pairs exhibit a strong "bias" towards one state 
relative to the other. This analysis extends to the distribution of 
polymorphic allele frequencies in genomes sampled from natural 
populations [33-35]. A similar approach can be applied to 
alternative nucleotides at particular positions within nucleosomal 
sequences. As the magnitude of selection increases, the expected 
frequency of preferred alleles increases. Consequently, the 
distribution of SNP frequencies (or "site frequency spectrum") at 
a given nucleosome position (analogous to a synonymous SNP) is 
expected to shift towards relatively higher frequencies of preferred 
alleles. 

If the observed dinucleotide patterns reflect selectively favored 
states, ancestrally unpreferred base pairs across nucleosomes 
should diverge towards the "preferred" state along species 
lineages. Further, if the "preferred" divergence patterns reflect 
the average impact of natural selection, then frequencies of 
polymorphisms in natural populations should be more skewed at 
sites experiencing stronger selection. "Unpreferred" variants, 
specifically substitutions or polymorphisms away from favored 
nucleotides, such as substitution of an ancestral A with a G at 
nucleosome positions which are systematically enriched for AA 
dinucleotides, should diverge more slowly and be rarer when 
polymorphic in the population. In contrast, "preferred" variants, 
such as substitution of an ancestral A with a G at positions of 
enriched for the GC dinucleotide, should diverge more rapidly 
and be more common when polymorphic. 

At the lower resolution of an entire nucleosome and its nearby 
flanking regions, both divergence and polymorphism are observed 
to vary [36-39], but evidence of a role for natural selection in the 
underlying evolutionary dynamics remains sparse [40-43]. Studies 
of human SNPs [37,38] and divergence in humans, yeast and 
medaka [36,38,39,43] show that both expected heterozygosity 
and divergence between species are elevated near the central dyad 
and depressed in the adjacent linker regions, though these patterns 
appear to differ by substitutional pathway [38]. One possibility is 
that patterns of variation relative to nucleosomes derive from 
nucleosome-specific mutational biases. This could result from 
suppression of mutation by a protective aspect of nucleosome 
occupancy [44], or it could arise from an interaction between the 
histone core and DNA damage recognition or repair mechanisms 
[45-48]. Of course, natural selection mediated via DNA:nucleo- 
some interactions may also strongly reshape the patterns of SNP 
variation and divergence between taxa [40-43]. Analysis of the site 
frequency spectrum promises to distinguish between these two 
alternatives. 

The whole-nucleosome-resolution analyses considered above 
cannot leverage the specific structural predictions of dinucleotide 
interactions with the core and their strong mechanistic implica- 
tions. Examination of polymorphism and divergence at each base 
pair position across the nucleosomal DNA opens a rich and precise 
view, as well as powerful tests of alternative mechanisms such as 
biased mutation and natural selection. We report the discovery of 
fine-scale periodicities in inter- or intra-species sequence variation 
relative to nucleosomes and discuss their implications for the role 
of natural selection mediated through nucleosome function. Our 
analysis of DNA sequence polymorphism and divergence across 
isolated nucleosomal fragments from D. melanogaster embryos 
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Figure 1. Dinucleotide frequencies in the D. melanogaster genome show small (~10 bp) and large (~180) scale periodicities relative 
to isolated nucleosomal fragments. (A) A cartoon representation of half of the dyad symmetric nucleosome depicting DNA associations with the 
histone core. The dyad axis is marked with a dotted line, and SHL locations are noted in black and red. Colored octagons represent regions where the 
DNA minor groove contacts the histone core. A base in the minor groove is marked for reference (red line and orange dot). The results of a 5 (middle) 
or 10 (right) bp translational movement are depicted. A 5 bp movement rotates the marked base away from the core, moving it to a region where 
the major groove faces the interior. Translational movement of 10 bp preserves the rotational position of the base (interior) and, given a —10 bp 
periodic spacing of preferred dinucleotides, maintains favored sequence positions. (B) Frequencies of AA, TT and GC dinucleotides across intergenic 
(black) and intronic (red) nucleosomal n147 (±50 bp). Note that frequencies are outside the range of the plots near the edges of the n147 because of 
the inherent sequence bias of MNase. (C) Deviations in the observed genomic dinucleotide frequencies relative to expected values based on overall 
nucleotide frequencies in intergenic (black) and intronic (red) regions. (D) Average frequencies of AA/TT (green) and GC (black) dinucleotides 
surrounding intergenic n147 regions (±1 kb). Note that frequencies immediately flanking the edges (±5 bp) of the n 147 are not plotted, since these 
largely reflect sequence bias of the MNase. Read depth of intergenic MNase-derived 142-152 bp (n 142-1 52) nucleosomal DNA fragments in grey (axis 
in blue). 

doi:10.1371/journal.pgen.1004457.g001 



reveals that nucleosomal sequences are diverging towards "pre- 
ferred" nucleotides. Regions where the minor groove is interior 
are becoming more AT-rich, and regions where the major groove 
is interior are becoming more GC-rich along the melanogaster 
lineage. Using a new index for quantitating the frequency 
spectrum (A n ), we identify clear signals associated with natural 
selection, which parallel the observed periodicities in divergence. 
This selection is strongest in intronic regions, where nucleosome 
assembly and positioning are expected to have greater functional 
impacts. These findings support the hypothesis that the widely 
observed sequence affinities of the core octamer have functional 
consequences that are subject to natural selection. Given the 
dominant role of nucleosomes in the packaging of the genome and 
their conserved sequence preferences, their interactions may 
broadly shape the sequence of melanogaster and other genomes. 

Results 

Periodicity in dinucleotide frequencies in D. melanogaster 

To investigate the impact of nucleosomes on DNA sequence 
variation, we isolated nuclei from D. melanogaster embryos, 
performed Micrococcal nuclease (MNase) digestion, and used 
paired-end sequencing to position fragments on the genome 
(Figure SI). Previous studies in Drosophila identified a range of 
periodic dinucleotides in association with nucleosomes [10,11]. 
Our collections of 276,614 intergenic and 270,998 intronic 
autosomal 147 bp nucleosomal fragments (hereafter nl47, Tables 
SI and S2) cover 68.5% of the unique intronic and intergenic 
euchromatic autosomal genome and display a ~ 1 0 bp periodicity 
for many dinucleotide frequencies (Figures 1 and S2). In these and 
subsequent analyses, the 5 '-3' sequence from bases —73 to —1 
were joined to the reverse complement of bases 1 to 73, to reflect 
the dyad symmetry of the nucleosome (see Materials and 
Methods). AA, TT and GC showed the strongest periodicity of 
WW and SS (where W = A|T, S = G|C) dinucleotide pairs, 
respectively (Figure IB). These same dinucleotides show a distinct 
overrepresentation in the non-coding regions of the genome as a 
whole (Figure 1C). As noted in previous studies, AA and TT are 
similarly periodic and occur where the minor groove is interior (at 
superhelix locations, SHL, ±(i+0.5); where i is 0, 1,...6). However, 
noticeable differences between the distributions are apparent. For 
example, the frequency of TT displays a distinctly smaller peak at 
~SHL 4.5, and AA frequency displays a stronger drop at ~SHL 2 
(Figure IB). GC frequency across nl47 regions is anti-correlated 
with AA/TT and is characterized by a prominent upward 
concavity (Figure IB). These dinucleotide periodicities extend 
well beyond nl47 edges into linker regions, consistent with the 
proposed translational step size of 10 bp. 

Upon examination of the dinucleotide frequencies flanking 
aligned nl47 regions, we discovered an additional large-scale 
pattern in AA/TT and GC dinucleotide frequencies (Figure ID). 



This ~ 1 80 bp periodic variation in frequency tracks with overall 
nucleosome "occupancy" in the regions flanking the nl47. 
Average AA/TT frequencies (Figure ID) and overall A/T 
frequencies (Figure S3) are higher in regions of greater nucleosome 
"occupancy" and lower in putative "linker" regions. Thus, the 
AA/TT sequence features that facilitate nucleosome formation are 
enriched over regions with higher nucleosome "occupancy." 
Conversely, GC frequency (and overall G/C frequencies, Figure 
S3) peaks at the periphery of more nucleosome-dense regions and 
in "linker" regions. These surprising "super-nucleosomal" peri- 
odicities extend the observed nl47 patterns to flanking multi- 
nucleosomal arrays, and suggest a contribution of sequence to 
translational positioning. Consistent with chemical mapping of 
nucleosomes, this result suggests that the observed experimental 
correlation between MNase nucleosome "occupancy" and GC 
content [1,8,22,26,49,50] reflects differential recovery, rather than 
positional preference [23,51]. 

Divergence along the melanogaster lineage mirrors 
periodic nucleosomal base preferences 

If variations in dinucleotide frequencies relative to nucleosomes 
result from accumulated sequence divergence, we expect substi- 
tution patterns to parallel the observed base preferences. However, 
the timescale(s) at which these patterns evolve is unknown. 
Lineage-specific or "polarized" divergence is the proportion of 
nucleotide sites that are different in melanogaster while identical in its 
sister taxa simulans (most recent common ancestor 2.5 MYA) and 
the proximate outgroup (yakuba or erecta; 6-7 MYA, see Materials 
and Methods). Overall genomic divergence on the melanogaster 
lineage shows a marked excess of G^A (inferred ancestral G, 
derived A in melanogaster) and C^T (ancestral C, derived T) 
substitutions compared to A-h>G and T^C (Figure 2A). This is in 
agreement with earlier estimates of divergence on the melanogaster 
lineage [52,53] and with the observed two-fold greater mutation 
rate [54]. We next considered the average divergence at each site 
across nl47 regions, normalized for underlying base frequencies. 
This analysis revealed a striking ~ 1 0 bp periodicity in transitions 
(GC— »AT and AT^GC) for two estimates of divergence; per~nl47 
in Figure 2B is weighted by the redundancy in the nl47 set, while 
per-site in Figure S4 weights each site equally. Rates for GC^AT 
and AT^GC are anti-correlated and track with underlying 
dinucleotide frequencies. Thus, ancestral GC bases are more likely 
to become AT in nucleosomal regions where AA/TT dinucleo- 
tides are in higher frequency, and AT bases are more likely to 
become G or C at sites where GC is enriched. GC^AT 
divergence also shows a marked curvature, with a peak at the dyad 
axis. 

Given the substantial variation in individual substitution rates, 
we next examined specific pathways to determine their relative 
contributions. Of all pathways, G^A, C^T, A-^G and T^C 
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Figure 2. Divergence on the melanogaster lineage mirrors nucleosome sequence preferences. (A) Average polarized intergenic and 
intronic heterozygosity (ti) and divergence on the melanogaster lineage for specific substitutions. (B) Smoothed average per-n147 polarized intergenic 
and intronic GC^-AT (green = intergenic, light green = intronic) and AT-^GC (black = intergenic, grey = intronic) divergence across aligned n147 
regions (±50 bp). Intergenic dinucleotide frequencies are represented above. (C) Smoothed average per-n147 polarized divergence across intergenic 
and intronic n147 regions (±50 bp) for specific substitutions. Indicated dinucleotide frequencies plotted above for reference. Note that for both B 
and C, the estimated divergence rates are outside the range of the plots near the edges of the n147. (D) Smoothed intergenic per-n147 G^A 
divergence mapped onto bases +73 to —6 from the nucleosome structure [55]. Arginines that contact the minor groove are color coded by histone 
and highlighted by arrows. (E) Smoothed average combined G^A:C^T polarized divergence surrounding intergenic (black) and intronic (red) n147 
regions (±1 kb). Note that divergences immediately flanking (±5 bp) the edges of the n147 are not plotted, since these largely reflect sequence bias 
of the MNase. Read depth of intronic (light grey) and intergenic (grey) n142-152 is represented below. 
doi:10.1371/journal.pgen.1004457.g002 



exhibited the most obvious periodicities in divergence (Figure 2C; 
see Discussion). In some cases, divergence patterns reflect the 
subtleties observed for dinucleotide frequency patterns. For 
example, C^>T rates are less peaked at SHL 4.5, the location of 



the lowest peak in TT frequency (Figure 2C). C— >T also shows a 
greater difference in rates between the nl47 periphery and linker 
regions (compared to G^A). Interestingly, for each ancestral base, 
the periodicities of substitutions that do not change GC content, 
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appear weaker, perhaps due to both scaling and weaker signal to 
noise (Figure 2C). A subset of non-overlapping nl47 regions 
showed similar patterns (Figure S5A). 

When mapped onto the DNA from the nucleosome structure 
[55], peaks of intergenic G^>A divergence clearly occur within 
regions where the minor groove is interior and in contact with key 
arginines of the histone core (Figure 2D). Note also the higher 
G^A divergence toward the central axis, as reflected by the 
downward concavity in Figure 2C. This is consistent with analyses 
of the impact of sequence variation on nucleosome structure, 
which identified this central region of H3/H4 interactions as most 
constrained [56]. Conversely, A^G substitution rates are highest 
in regions where the major groove is interior (Figure S5B). This 
pattern is consistent with established SS dinucleotide patterns 
[8,10-14] and the observation that GC rich sequences are 
disfavored for minor groove compression and favor narrowing of 
the major groove [9,18]. 

Divergence patterns should also reflect the observed nucleo- 
some-scale periodicities in base and dinucleotide frequencies 
(Figures ID and S3). To increase signal, we combined comple- 
mentary substitutions, G^A and C^>T (G — »A:C— »T) and A-h>G 
and T-h>C (A— »G:T— »C). Aligned nl47 regions show substantially 
lower divergence rates than their immediate flanking sequences 
(Figure 2E). Rates drop to the local background within ~500 bp, 
following the skew of AA/TT dinucleotides (and overall AT 
content; Figures ID and S3). In spite of this local variation in rates, 
due at least in part to MNase preferences (Figure S6), we observe a 
large-scale (180 bp) periodicity in G^A:C^T divergence sur- 
rounding intergenic nl47 nucleosomal regions (Figure 2E). Introns 
showed a similar but weaker pattern, potentially due to the 
influence of flanking coding regions (Figure 2E). Any periodicity of 
the A— »G:T— »C divergence in flanking regions is less obvious 
(Figure S5C), at least partially due to a 50% lower rate of 
divergence and thus inherently weaker signal. 

These large-scale patterns allow us to resolve general trends in 
divergence relative to nucleosome occupancy. We find that, on 
average, G^A:C^T changes along the melanogaster lineage are 
fixed at higher rates across nucleosomes relative to linkers, 
mirroring underlying AA/TT dinucleotide frequencies. This is 
in apparent contrast to the report that the cytosine deamination 
mutational pathway (a major source of G^A:C^T transitions) 
and associated divergence is suppressed by nucleosome occupancy 
[44]. To clarify this discrepancy, we examined the interactions 
between divergence and "occupancy" of the nl47 fragments, as 
estimated by depth of coverage by 142-152-bp nucleosomal 
fragments, nl42-152. Indeed, we observe a negative correlation 
between this metric and all substitutional pathways (Table S3, 
Figure S7A). However, we note that n 142- 152 coverage is 
correlated with GC content of the nl47 region (Figure S8D), as 
previously reported in other studies [1,5,8,26,49], and that 
correlations between nucleosome fragment GC and divergence 
are even more striking (Table S3, Figure S7B, S8C). This is also 
true for 500 bp intergenic windows, independent of nucleosome 
coverage (Table S3). When we parse nl47 by nl42-152 
"occupancy," we observe differences in AA/TT frequency, 
G^A:C^T divergence, and nucleosome phasing in flanking 
regions (Figure S8A). The periodicities of these features are most 
obvious surrounding highly occupied (GC-rich) nl47 regions, 
but they do not appear to be unique to them. Thus, we conclude 
that nucleosome bound regions in D. melanogaster embryos are 
generally more AT-rich and have higher rates of G^A:C^T 
substitution than their adjacent "linker" regions, inconsistent 
with the fundamental claim in Chen, et al. [44] (mentioned 
above). 



Periodicity in the site frequency spectra of natural 
populations: Evidence of natural selection 

The divergence patterns we observe are consistent with known 
nucleosomal dinucleotide preferences [5,8,10-17]. This is analo- 
gous to observations for codons, where substitutions mirror 
genome-wide codon usage biases and are attributed to natural 
selection for preferred codons [34,57,58]. However, divergence 
patterns alone cannot exclude the hypothesis that substitutional 
patterns result from biased mutation relative to nucleosomes. 
Mutation rates may vary across nucleosome-bound regions and 
could lead to compositional variation and different rates of 
divergence. Nevertheless, once a new selectively neutral allele 
arises, its dynamics and thus its distribution of frequencies are 
independent of type (or rate) of mutation [59,60] . While natural 
selection influences the probability of fixation (thus the rate of 
divergence), mild differences in fitness will also shift the site 
frequency spectra of polymorphic alleles [61-63]. Neutral and 
deleterious mutations tend to spend much of their typically short 
lives as rare alleles, while weakly favored alleles will be found at 
higher frequencies as many more drift towards fixation. Although 
the impacts of varying demographic histories [64] and of linked 
selection [48,49] can lead to distributions of selectively neutral 
polymorphisms that mimic particular forms of selection, they 
should do so randomly across the genome and not show a 
positional relationship within nucleosomal sequences. 

The hypothesis that nucleosome structure and function impose 
natural selection on genomic sequence variation predicts period- 
icities in the frequency spectra. Indeed, the average per-nl47 
frequencies of G-A and C-T SNPs in a sample of 36 D. melanogaster 
genomes from Raleigh (North Carolina) exhibit nucleosomal 
patterns paralleling those observed for dinucleotides and polarized 
divergence (Figures 3A and S9). Frequencies of A alleles at G-A 
SNPs show clear periodicity across intergenic and intronic nl47 
regions, extending into linker regions (Figure 3A). A alleles are 
relatively more common in SHL ±(i+0.5) regions, and G alleles 
are higher in regions where the major groove faces the histone 
core. Removal of singleton SNPs (cases where either allele is 
observed only once), which can mitigate the impact of possible 
sequencing errors, raises average A frequencies but does not 
eliminate the periodicity (Figure S9). Partitioning such SNPs by 
ancestral state can remove the impact of average mutation rate 
differences and reveal differences in the patterns of selection. 
Nucleosomal patterns of the average per-nl47 frequencies of 
derived SNPs, such as G^GA (ancestral G and a derived, 
polymorphic A), exhibit clear periodicities that generally parallel 
divergence and nucleosomal dinucleotide frequencies (Figures S10, 
intergenic, and SI 1, intronic). 

To systematically assess the periodicity in the frequency spectra 
we calculated a new index, A K , (closely related to Tajima's D [59]) 
across nl47 regions for the Raleigh sample [65]. Where p is the 
frequency of a SNP in the sample and fc = 2p(l —p) is the estimate 
of the heterozygosity, we define A n as the average (per SNP) 
deviation in n from expectation under equilibrium between 
genetic drift and mutation to selectively equivalent alleles (see 
Materials and Methods). The "folding" of the frequency spectrum 
such that p is equivalent to ( 1 —p) mitigates the impacts of errors in 
the inference of the ancestral state [66] and emphasizes variation 
in the midrange of p. Weak positive selection is predicted to skew n 
toward higher values (more positive A n ), while weak negative 
selection leads to more negative A n . Thus, systematic differences in 
selective forces at different positions across nl47 regions should 
yield a pattern in A K that parallels that observed for divergence. 
These patterns of A n are superimposed on the observed genome- 
wide average negative skew [65] (Table S4) that can be attributed 



PLOS Genetics | www.plosgenetics.org 



6 



July 2014 | Volume 10 | Issue 7 | e1004457 



Nucleosomes Shape DNA Polymorphism and Divergence 




-100 -50 0 50 100 

Distance [bp] 



INTERGENIC 



INTRONIC 




-100 -50 0 50 100 
Distance [bp] 



-100 -50 0 50 100 
Distance [bp] 




-1000 



-500 0 500 

Distance [bp] 



1000 



r-5 



■8 



0) fM 



-1000 -500 0 500 

Distance [bp] 



1000 




1.00 
0.75 
0.50 
0.25 
1 0.00 



A^C 
T^A 



T^G 
G^T 
C^A 



'tya 4th 'tya '^fs- 'tys- 

r/ b„ f s r n r/ b„ r/ bf, . 

&S n .O/ 0 S/ s 0/ c .fife. 0/~ 



smoothed A 



IT 



INTERGENIC 



INTRONIC 




-100 -50 0 50 100 
Distance [bp] 

Non-overlapping A n 
Rwanda smoothed A„ 



100 -50 0 50 100 
Distance [bp] 



PLOS Genetics | www.plosgenetics.org 



7 



July 2014 | Volume 10 | Issue 7 | e1004457 



Nucleosomes Shape DNA Polymorphism and Divergence 



Figure 3. Periodicity in average SNP frequencies aligns with nucleosomal divergence and dinucleotide sequence preferences. (A) 

Smoothed average intergenic (black) and intronic (red) per-n147 frequency of A alleles at G-A polymorphic sites in the Raleigh sample. (B) Average 
smoothed (black) and unsmoothed (grey) per-n147 A„ across n147 regions (±50 bp) for G^GA, C^»CT, and A-^AG polymorphic sites. Color-coded 
dinucleotide frequencies are included for reference. (C) Average smoothed n147 G^GA A„ for the Rwanda sample (orange) and across a non- 
overlapping subset of intergenic and intronic n147 regions (teal). Note that for both B and C the estimated divergence are outside the range of the 
plots near the edges of the n147. AA dinucleotide frequencies included for reference (green). (D) Combined intergenic and intronic (G^A:C^T; 
green) divergence and (G^GA:C^CT; black) A„ surrounding intergenic n147 (±1 kb). n142-152 read depth in grey (axis in blue). Note that 
divergence and A„ immediately flanking (±5 bp) the edges of the n147 are not plotted, since their true values are largely obscured by sequence bias 
of the MNase. (E) Average scaled intergenic and intronic per-n147 n, A„ and divergence for specific substitutions (see Table S4). 
doi:10.1371/journal.pgen.1004457.g003 



to strongly deleterious mutations [67,68], varying demographic 
history [59,64] or linked selection (background selection [69] and 
hitchhiking [70]). 

Indeed, when we examined average A n for G-^GA polymor- 
phisms, we discovered a clear ~10 bp periodic skew in frequency 
across nucleosomal regions, mirroring G— >A divergence (per-nl47 
in Figure 3B and per-site in Figure S12). nl47 G^GA A n are less 
negative in regions of higher AA dinucleotide frequency. 
Interestingly, intronic G-^GA A K shows even more pronounced 
periodicity in the frequency spectrum, including the prominent 
drop at SHL 2 observed for AA frequency (Figures 3B and SI 2). 
A n for C^CT polymorphisms is also periodic in introns (both per- 
nl47 and per-site), with peaks aligning with regions of high TT 
frequency; while intergenic nl47 share a subset of these peaks 
(Figures 3B and SI 2), the overall patterns show much weaker 
periodicity (see below). Although peaks in intronic C^>CT A n 
overlap roughly with those for G^GA sites, they show a more 
convex shape, similar to the C^T divergence. Substitutions in the 
complementary directions (e.g. A-H>AG) also show a periodic skew 
in allele frequencies. Introns display a striking periodicity in 
A-^AG A n aligned with GC frequency (Figure 3B, while 
intergenic nl47 A-hAG sites show only two peripheral A K peaks 
and several peaks (valleys) that are discordant with the GC 
dinucleotide periodicity. Like underlying GC frequency, intronic 
A-^AG A n has a concave upward shape. We observe weaker but 
interesting indications of continued periodicity in linker regions, 
consistent with selection for the preservation of rotational 
positioning in association with translational repositioning. The 
patterns of A n for 5 non-overlapping subsets of n 147 regions were 
similar (Figure SI 3). We conclude that for several substitutional 
pathways there is strong evidence of selection maintaining the 
observed nucleosomal (di)nucleotide preferences. 

The periodicity of nucleosomal is not limited to the Raleigh 
population. The strongest of these periodic patterns in are also 
apparent in a smaller, independent set of 2 1 sequenced genomes 
from a Rwandan (Africa) population [71] (Figures 3C and SI 4), 
which also exhibits more negative average A K values. This African 
sample is assumed to represent a larger, more stable population 
from the center of the species distribution, while the Raleigh 
sample represents the serial diasporas out-of-Africa and into North 
America. Notwithstanding differences in average A n , these strong 
and predicted periodicities in nucleosomal in A^ support our 
hypothesis that direct interactions between the histone core and 
DNA sequence polymorphisms yield functional effects with fitness 
consequences. 

An alternative hypothesis to explain these periodicities holds 
that the sequences evolve independendy of natural selection and 
that the in vivo positions of our isolated nucleosomes reflect the 
innate preferred rotational positions of the particular genome 
used. Derived SNPs detected in a single strain are likely to be in 
high frequency, and thus we might observe periodicity in the 
frequency spectra at such SNPs in the absence of natural selection. 
To test for the impact of this hypothesized ascertainment bias on 
the periodicity of A n , we filtered the nl47 for those in which the 



source genome bore the ancestral alleles. Despite the unavoidable 
thinning of the data, we observed clearly periodic polarized A^ for 
those pathways with the strongest initial signals, e.g. intronic 
G^GA, C^CT and A^AG (Figure SI 5). These results indicate 
that the observed periodicities in the frequencies of preferred bases 
(parallel to the dinucleotide frequencies and the divergence) 
cannot be attributed to biases in the ascertainment associated with 
the genotype from which the nucleosomal sequences were 
prepared. 

We next considered the values of A n surrounding nl47 regions. 
The observed skew in intergenic G^GA:C^»CT A n extends into 
adjacent sequence (Figure 3D), tracking with the periodicity of 
G — >A:C^T divergence. Interestingly, in the ~500 bp flanking 
nl47 regions, there appear to be major and minor A^ peaks 
associated with each divergence peak. Given the shoulder of 
C^»CT A n values in linker regions adjacent to nl47 (Figure 3A), 
this could represent a nucleosomal and a linker peak. Intronic 
regions show higher overall values of G — >GA:C^CT A n and 
similar, but weaker, indications of increased G^GA:C^CT A n 
associated with nucleosome occupancy (Figure 3C). Among other 
interesting patterns in A n and contrasts to divergence in these 
flanking regions are those associated with the complementary set 
of substitutional paths, A^AG:T->TC, which exhibits peaks over 
apparent linker regions in A n but no parallel pattern in 
A-^G:T-^C divergence (Figure SI 6). 

On average (per-nl47), G^GA and C^CT are the most 
common polymorphisms and have among the most positive A n , 
indicating weak positive selection, in addition to being the most 
rapidly diverging bases (Figure 3D). Although rates of A-h>G and 
T^C divergence (and rates of associated polymorphisms) are 
much lower, these types of polymorphic sites also have high 
average A n (Figure 3E). Thus, substitutions with the most periodic 
divergence and A n also show the least overall negative skew in the 
frequency spectrum. Relative relationships of n 147 average K, A„ 
and divergence are quite similar to those of a non-overlapping 
subset and to the genome-wide averages (Table S4 and Figure 
SI 7). These broad genomic patterns appear inconsistent with 
equilibrium models and may reflect heterogeneity and/or recent 
(transient) shifts in selective forces [35,52,72]. 

Discussion 

Histones are among the most ubiquitous and highly conserved 
eukaryotic proteins. Thus, it is not surprising that nucleosomal 
dinucleotide periodicities, which derive from key structural 
interactions between DNA sequence and the histone core, are 
shared widely across species. In spite of the near universality of 
these patterns among eukaryotes and decades of research, our 
understanding of their functional impact and evolutionary 
dynamics remain unsettled. In this work we examined genomic 
variation across regions defined by isolated nucleosomal DNA 
fragments. Our goal was to first determine if these regions showed 
interpretable variation in divergence between species, then to 
analyze population genomic variation for evidence of a role for 
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natural selection in the generation and maintenance of nucleo- 
some-associated sequence variation. 

We find that divergence on the melanogaster lineage mirrors the 
sequence preferences of the histone core. This periodic variation in 
substitution rates across nucleosomal regions indicates that interior 
minor groove regions display more rapid substitution of AT for 
GC, and that AT base pairs in regions where the major groove 
faces inward are more likely to become GC rich. These striking 
patterns align directly with dinucleotide patterns that stabilize 
associations between DNA and the histone core, as documented in 
numerous biochemical and structural studies [5,8,15,18,19,56]. If 
nucleosome-bound regions are evolving toward the observed 
nucleosome sequence preferences, a key question is whether this is 
the result of mutational bias relative to the positioning of 
chromatin proteins, or whether it is the consequence of natural 
selection based on functional differences. The available depth of 
population data and our new index A„ allowed us to directly 
address this question. We find remarkable periodicities in A„ that 
parallel the observed patterns of divergence. The spectra of SNP 
frequencies across n 1 47 regions are variable, with higher A K when 
the inferred ancestral allele is unpreferred, and the derived allele is 
structurally favored. Therefore, we conclude that selection is, at 
least in part, driving the maintenance of nucleosome-associated 
sequence patterns on the melanogaster lineage. 

If the fitness differences associated with such histone:DNA 
interactions are largely arising from nucleosomal dynamics 
(assembly, disassembly, movement and modification) and rota- 
tional positioning of functional elements, then we can further 
hypothesize that transcribed (intronic) nucleosomal sequences 
should exhibit stronger periodicity than untranscribed (intergenic) 
nucleosomal sequences. Consistent with this hypothesis, correla- 
tions of lineage specific divergence and A^ with the relevant 
underlying dinucleotide frequencies are stronger for intronic 
sequences. Table 1 shows that in each case where a large 
difference between intergenic and intronic is apparent, it is the 
intronic that is larger. The two exceptions, G— >A & AA and A-^G 
& GC, are those where both correlations are among the highest. 
As might be expected given the longer timescale and greater 
number of variable sites, divergence correlates more strongly with 
dinucleotide frequencies than A„. Interestingly, Figures 2C and 3B 
show that these intronic vs. intergenic differences in correlation of 
divergence and A n with dinucleotide patterns may be attributable 
to large deviations from expectation in specific regions of the 
nucleosomal sequences, while other regions follow the expected 
periodic patterns. 

While natural selection is the most direct interpretation of these 
results, interactions between chromatin proteins and DNA 
damage and repair are well documented [44-48,73-76]. Contex- 
tually biased mutation (substitution) pathways could underlie the 
observed periodicities in nucleosomal divergence. However, 
Drosophila does not have a significant level of 5-methylcytosine 
[77], the deamination of which is thought to drive the strong 



contextual biases (NpCpG) in vertebrates [78]. Indeed, a recent 
genomic sequencing study of Drosophila mutation accumulation 
lines yielded no evidence for contextual biases [54]. Most 
importandy, such sequence-contextual as well as nucleosome- 
mediated biases in mutation rates are excluded as an explanation 
for the observed periodicities in the skew of the SNP frequency 
spectrum (A n ), since strictly neutral mutations should display the 
same frequency distribution across the genome [31,33,59]. 
Support for a role of natural selection maintaining these 
periodicities is bolstered by the stronger periodicities in intronic 
nucleosomal sequences, where transcription-associated remodeling 
and disruption of nucleosome-DNA interactions are more likely to 
have functional impacts. 

There is, however, one potential "selectively neutral" mecha- 
nism to explain the observed periodic patterns in A K . Biased gene 
conversion (BGC), a process where heteroduplex regions formed 
between homologs are repaired in a direction favoring one base, 
can create SNP dynamics analogous to those of directional 
selection [79] . BGC systematically favoring GC over AT has been 
observed in a few species and indirectly implicated in others by 
associations of local GC content with estimated rates of crossing 
over [80]. However, evidence for such an association is not 
observed in Drosophila [81]. Given that the magnitudes of average 
A n and its periodicities for G — >GA and C^CT SNPs are 
comparable to those of A-h>AG, any explanation of our results 
invoking BGC would have to involve multiple distinct gene 
conversional biases that depend on nucleosome position. While 
this is conceivable and worthy of further investigation, we 
conclude that the canonical GC-biased gene conversion is not a 
significant component of the evolutionary dynamics leading to 
these intricate nucleosomal patterns of polymorphism and 
divergence. 

Whether these periodic patterns are the product of natural 
selection or BGC, the magnitude of the average force shaping the 
dynamics of nucleosomal SNPs must be small compared to that 
affecting the evolution of nonsynonymous variants. The shifts in 
G-h>A divergence between peaks and valleys in the nl47 are 
~0.001 against a background average of ~0.01, suggesting 
relatively weak constraint of 1 in 10 mutations. The non- 
synonymous rate of divergence on the melanogaster lineage, 
~0.006, is about one tenth of that for synonymous divergence 
corresponding to 9 out of 10 mutations being selected against [61]. 
Comparable conclusions could be drawn from the modest 
magnitudes of periodic fluctuation in expected heterozygosities 
and, indeed, in the widely observed periodicities in dinucleotide 
frequencies of nucleosomal sequences. Still, by virtue of its four- 
fold greater genomic footprint, the net selective impact of just the 
selection associated with such nucleosomal periodicities could 
approach the magnitude of non-synonymous variants. As is the 
case for coding sequence, differences in the relative (average) rates 
reflect the aggregate impact of selection that must vary substan- 
tially among nucleosomes, as well as among sites. 



Table 1. Pearsons correlation of lineage specific (per-site) divergence or A n with dinucleotide freq 
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Evidence of natural selection supporting nucleosome-associated 
sequence periodicities and the implication of their biological 
impact casts the potential functions of non-protein-coding regions 
in a new light. Substantial portions of Drosophila, human and 
other genomes appear to be under evolutionary constraint, yet 
lack any functional annotation [67,68]. Further, SNPs identified 
by genome-wide association studies (GWASs) of interesting human 
phenotypes often have mild attributable effects and map to 
unannotated intronic or intergenic regions, where mechanistic 
hypotheses concerning the impacts of such genomic variation are 
lacking. We demonstrate that at least part of the constraint in 
Drosophila arises from interactions between histone proteins and 
DNA sequence. 

Our results suggest dinucleotide periodicities and the rotational 
positioning that they guide have significant biological consequenc- 
es. Sequences affecting rotational positioning can influence the 
binding of transcriptional activators and participate in regulation 
of expression or gene splicing [25,28,82-84]. More generally, they 
impact nucleosome assembly and stability [2-7,9,17], properties 
that broadly impact chromatin dynamics and may influence 
higher order chromatin structures. Further, the observed large- 
scale periodicities in dinucleotide frequencies (and divergence and 
A„ patterns supporting them) demonstrate that sequences that 
facilitate rotational positioning are specifically enriched relative to 
adjacent nucleosomes. So, while periodic sequence patterns are 
considered more relevant to rotational positioning, they clearly 
interact with the translational positioning of arrayed nucleosomes 
in Drosophila. Going forward, deeper and more detailed 
population genomic analyses should provide a unique window 
into the complex in vivo interactions between DNA sequence and 
nucleosome function. 

The significance of these periodic patterns of polymorphism and 
divergence is amplified in light of the substantial proportion of the 
eukaryotic genome packaged in nucleosomes (four-fold greater 
than that of coding sequence in Drosophila) and the broad 
conservation of dinucleotide interactions with the histone core. 
Indeed, no other DNA-protein interaction remotely approaches 
the genomic density or structural impact of nucleosomes. The 
striking periodic variation we observe relative to nucleosomes 
fundamentally changes expectations about divergence and SNP 
frequency, particularly in non-protein-coding regions. Our results 
point to a layer of evolutionary forces across entire genomes, 
emanating from the interactions of DNA sequence variation with 
the structure and function of the histone core. 

Materials and Methods 

Nuclear isolation and MNase digestion 

Embryos were collected from population cages [85] over a 1 hr 
period and aged at 25 °C for 2-3 hr. Staged embryos were 
dechorionated in 50% bleach for 2 minutes, washed extensively, 
and then homogenized on ice in SEC buffer (10 mM HEPES, 
150 mM NaCl, 10 mM EDTA 10% glycerol, 1 mM DTT) with 
Protease Inhibitors (PI) (0. 1 mM PMSF and 2X Roche EDTA- 
free Protease Inhibitor tablets). After lysate filtering and centrifu- 
gation, pelleted nuclei were resuspended in CIB (15 mM Tris 
pH 7.5, 60 mM KC1, 15 mM NaCl. 0.34 M Sucrose, 0.15 mM 
Spermine, 0.5 mM Spermidine)+PI and then repelleted. Centri- 
fugation of nuclei in CIB was repeated 3 times, and the resultant 
pellet was flash frozen and stored at — 80°C. 

Pelleted frozen nuclei were resuspended in CIB+PI, and 
chromatin was digested with 0.5 U/ml Micrococcal nuclease 
(Sigma) for 37°C for 15 min. MNase treated nuclei were pelleted, 
resuspended in 0.1% NP-40 PBS+PI, and incubated at 4°C for 



3 hrs to release (primarily) mononucleosomes. Nuclei were re- 
pelleted, and chromatin from the supernatant was phenol- 
chloroform extracted. Digestion was analyzed on an agarose gel. 

Library construction and sequencing 

Approximately 882 ng of DNA was used as starting material for 
paired-end sequencing library construction following the Illumina 
protocol (PE- 102- 1001). 10 u.1 of paired end adapter oligos were 
ligated to the end-repaired, A-tailed fragments in a 50 ju.1 reaction. 
The adapter-ligation product was gel-purified to select molecules 
approximately 150-700 bp in length and re-suspended in 30 |0.1 
total volume. 1 |J.l of size-selected ligation product was used as 
template for 12 cycles of library enrichment PCR in a 50 ul 
reaction volume. The enriched library was purified using 
QIAGEN MinElute columns and sequenced (2x36 cycles) on 
one lane of a flow cell (FC42JB8) with an Illumina GAIIx running 
the Illumina software SCS v2.4.135/Pipeline vl.4.0. 

Subsequently, 8 ul of the same size-selected ligation product 
was used as a template for 10 cycles of library enrichment PCR 
in a 100 u.1 reaction volume. To enrich for 147 bp fragments, 
the library was purified using QIAGEN MinElute columns, then 
size selected on an agarose gel to recover fragments approxi- 
mately 273 bp in length (as determined by an Agilent 
Bioanalyzer). This size-selected library was sequenced (2x36 
cycles) on four lanes of a flow cell (FC61BGN) with an Illumina 
GAIIx running the Illumina software SCS v2.5.38/Pipeline 
vl.5.0. 

Read mapping and filtering 

Reads that passed the Illumina pipeline's quality filters were 
then aligned to the Berkeley Drosophila Genome Project's Release 
5 reference sequence [86] using Version 0.7.0 of the MAQ 
program [85]. Read pairs that mapped more than 1,000 bp apart 
and those for which the combined sum of the quality scores of 
mismatches exceeded 300 were filtered using the maq map -a 1000 
-e 300 command. Otherwise, default map parameters were used. 

147 bp Fragments in intronic and intergenic regions 

Mapped paired end clones of length 147 bp were then filtered 
based on Release 5.16 FlyBase Annotation of the D. melanogaster 
genome and classified as intronic or intergenic. For classifica- 
tion, all bases, including the flanking ±50 bp, were required to 
map entirely to a contiguous intronic or intergenic region. 
Heterochromatic reads were removed using cytogenomically- 
defined boundaries [87]. Downstream analysis was carried out 
on 276,614 intergenic and 270,998 intronic autosomal 147 bp 
nucleosomal fragments, referred to as intronic and intergenic 
nl47. These cover 61% and 79% respectively of the target 
intergenic and intronic regions in the euchromatic autosomes 
(chr2 and chr3). The coordinates of the nl47±50 bp flanking 
regions are in Tables SI and S2. Average nucleosomal read 
depths, where represented, are a pileup representation of a set of 
similarly processed paired end clones with lengths ranging from 
142-152 bp. 

In calculating dinucleotide frequencies, divergence, 7t and A n 
across nl47 regions (and, where relevant, ±50 bp flanking), both 
positional and average calculations took the dyad symmetry into 
account. Substitutional pathways were switched at the dyad axis, 
such that positions —73 to —1 were joined to the reverse 
complement of bases 1-73. Where included, flanking regions 
(±50 bp) were treated similarly. For larger scale positional 
divergence and genomic averages, data from complementary 
substitutional pathways were combined. 
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Dinucleotide frequencies 

The nl 47 dinucleotide frequencies are the averages (in the 
reference sequence) over all nl47 fragments for each position. 
Genomic over/ underrepresentations of dinucleotides were calcu- 
lated by dividing the difference between observed and expected 
frequencies by the expected frequency. Estimates of expected 
intergenic and intronic dinucleotide frequencies were calculated 
based on underlying base frequencies. Observed frequencies were 
computed direcdy from the reference sequence. 

Population genomic data 

The sequences of the euchromatic portions of 36 D. meltmogaster 
genomes from Raleigh, North Carolina were released by DPGP 
(http://www.dpgp.org/lK_50genomes.html - Reference_Release_ 
1.0). The sequencing, alignment and assignment of estimated quality 
scores are described in Langley. etal, 2012 [65]. The sequences of the 
22 D. meknogaster genomes from Rwanda, Africa were released by 
DPGP (http://www.dpgp.org/dpgp2/update_20Jan2012/dpgp2_ 
v2_rg.ID5.nohets.fastq.bz2). The sequencing, alignment and assign- 
ment of estimated quality scores are described in Pool et al., 2012 
[7 1] . For both data sets, only bases with a minimum quality score of 
Q30 or greater were included in the analyses. 

Divergence, frequency, n and A K 

Calculations of divergence on the meknogaster lineage, frequency, 
expected heterozygosity (it) and the index of skew in the frequency 
spectrum (A n ) were based solely on sites that could be polarized 
using a multiple alignment of D. meknogaster, D. simukns, D.yakuba 
and D. ereeta genomes [65], i.e., simukns and yakuba and/or erecta 
have the identical base. For consideration of the potential impact 
of ascertainment bias association with the isolation of nucleosomes, 
sites were subjected to a more stringent polarization (simukns, 
yakuba and erecta have the identical base) and calculations were 
done only on sites where the experimental genome had the 
ancestral allele. These statistics (divergence, n and A^) were 
estimated with two alternative weightings. The first, per-nl47, gave 
weight to genomic sites proportional to their occurrence at 
nucleosome positions among the nl47. Thus, in instances where a 
particular site was found multiple times in the nl47 set, the 
divergences or SNPs at that site were give proportionally more 
weight. This in effect weights the signal by nucleosome site 
"occupancy." The second, per-site, counted normalized the 
weighting such that each site (conserved, divergence or SNP) 
contributed equally, independent of its recurrence in the nl47. 
The per~nl47 estimates reflect those nucleosomal sequences that 
were readily isolated, while the per-site method treats each 
population genomic variant equally. 

A third representation of the mapping of divergence and A,i 
consists of random non-overlapping regions sampled from the 
nl47. These analyses are presented to simply address whether the 
periodicities arise solely from a ~ 1 0 bp periodicity in the overlap 
of nl47 fragments. For divergence, non-overlapping nl47 sets 
(72,710 intronic and 72,859 intergenic sequences) were generated 
by random sampling of intergenic and intronic nl47 without 
replacement. Newly drawn sequences were added to the non- 
overlapping subset only if they did not share any positions with 
prior sampled sequences. These non-overlapping subsets together 
cover 53% of the target intergenic and intronic regions in the 
euchromatic autosomes (chr2 and chr3) covered by the full nl47. 
For non-overlapping A n , intergenic and intronic nl47 (not 
including the flanking ±50 bp) were first filtered for only those 
nucleosomal regions containing the relevant SNP (taking 
dyad symmetry into account). This produced non-overlapping 
intergenic and intronic sets of ~90,000 regions each for G-h>GA 



and C^CT and ~55,000 regions for A— »AG. These sets were 
then subjected to random sampling without replacement. A new 
nl47 was added to the set if it did not overlap any already in the 
set. Non-overlapping G^GA and C^CT intronic and intergenic 
sets contained ~46,000 regions each and A-h>AG sets contained 
-30,000. 

All three of these methods for calculating divergence and A n 
yielded similarly periodic patterns reflecting the fact that while the 
genomic coverage of the nl47 is not deep, it is also relatively 
uniform (cv 0.65) and the periodicities are not arising from a small 
subset of the nl47 or interactions from overlapping nl47 
sequences. 

Definition of divergence. Divergence was based on the 
species reference sequences (see above). Polarized divergence 
estimates were calculated as the number of specific substitutions 
(e.g. number of G — >A) divided by the number of polarized sites 
inferred to have been the specific ancestral state (i.e., G in this 
case). For nl47 average divergence, 5 bp at each end of the 
fragments were trimmed from the analysis to minimize the 
influence of sequence bias at the enzyme cleavage site (Figure S4). 

Definition of frequency and it. All estimates of frequency 
and expected heterozygosity, n were calculated over sample sizes 
between n min and Vn as 



j "max 

frequency = — 



L/=i . 
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where L is the total number of sites (bp) with sample size between 



. and n„ 



; the frequency of an allele at the j of Si sites with 



sample size i. These SNPs can be categorized simply by state (A, 
C, G or T) or also as derived from an inferred ancestral state (e.g., A 
in a G^>GA SNP). For nl47 average n, nl47 regions were 
trimmed as described above for divergence. 

Definition of A n 

We require a sample-based index of the skew in the site 
frequency spectrum. Tajima [59] proffered the test statistic, D, a 
normalization of d, the difference between two estimates of the 
same population parameter, 4Mji, where ji is the mutation rate to 
selectively neutral alleles and N is the population size. These 
estimates are 



k = Lk = 



n-1 
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and 



M = 6L~- 



7=1 



in a sample of size n, where x, is the frequency of one of the two 
alleles at each of the S segregating sites in an arbitrary 
genomic segment of length L base pairs. Thus ft (expected 
heterozygosity) and 9 are per site (base pair) estimates of 
4Nn. 
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But here we seek not a test statistic for a genomic segment, but an 
index of the same deviation, that can be aggregated across 
heterogeneously sampled data and compared across classes of 
genomic annotation. 

To that end consider 



If the frequencies at different sites are independent then we can 
estimate the sample variance of A n (ri) for S sites with sampling 
depth n, simply as Var[A n (rij\ I S. 

Assuming again that these SNPs are sampled independendy 
both within and over sample sizes (i.e., no linkage disequilibrium), 
the average A„ can be estimated by the weighted average 



a simple rescaling of Tajima's (little) d, 



£ K{i 

"max 

£ 



Var[A n (i) 



Var[A n (i) 



d = SA 7[ . 

Tajima [59] also presents the distribution of the proportion of 
S segregating sites with frequency i/n in the sample, G„(i)/S. 
G„(i) can not be used to compute properties of a sample unless 
one can argue that the sites evolved independently and are 
sampled independently. If we choose a set of S segregating sites 
(assumed to be independently sampled from a population, i.e. 
no linkage disequilibrium), rather than a genomic segment, we 
have the expected heterozygosity in a sample of size n, 



E(k\n,s) = *£<?„( 
i=\ 

n-l 

*£ 



21^ — 
n \ n 



1 l 1 \ On-i) 
i n — ij n(n — 1) 



7=1 



and so our index, 



A„=^[k-E(k\n,s)] = 



£('*) 



This index is thus a measure of the deviation of the population 
(expected) heterozygosity per segregating site from its predicted 
value under the assumptions of equilibrium between selectively 
neutral mutation and genetic drift in a Wright-Fisher 
population. 

To estimate this deviation across sites with different sample 
sizes, we can calculate the weighted average, weighting by the 
reciprocal of the variance. 

The variance of A % is 



Var[A n {n)] = n ^— 



k\ 2S 



2i(n — i) 



n(n—l) 
"-' '\ 1 \ y\n-jf 

l) 2 



pi \J n-j) n 2( n - 



Notice that this is the theoretical variance in ft at a single 
segregating site in a sample of size n under the assumptions of the 
neutral model (above). 



where « min >3 and « milx = largest sample size. 

A n for each position in the nl47 regions (and its average 
across positions) was calculated over polarized sites with 
sampling sizes (n) between 32 and 34 in the Raleigh [65] and 
18 to 21 for Rwandan data [71]. For nl47 average A K the nl47 
regions were trimmed as described above for divergence. Data 
from complementary pathways were merged for larger scale 
positional A n . 

Correlations of divergence with GC frequency and 
nucleosome "occupancy" 

n 142- 152 coverage of intergenic nl47 regions was defined as 
the sum of the coverage of the region by the larger set of 1 42— 
152 bp nucleosomal fragments. This corresponds to what some 
authors call "occupancy." GC frequency of intergenic nl47 
regions was calculated based on the nucleotide frequencies in the 
D. melanogaster reference sequence from bases 6 to 141 of the n 1 47 
regions (to minimize the potential impact of MNase sequence 
bias). For density plots and Spearman's p, only nl47 regions with 
at least 50 (out of 147) intergenic bases polarizable were included 
in the analyses. As above, the interior 5 bp on each end of n 147 
regions were trimmed from the analysis. For correlations between 
divergence and genomic intergenic GC frequencies, Spearman's p 
were reported for non-overlapping 500 bp windows in which at 
least 166 intergenic bases were polarizable and no more than 
250 bp in the reference were "N". 

Plot details 

Plots were generated using R [88]. Prior to plotting, all 
calculations were symmetrized around the dyad axis. nl47 
(±50 bp) divergence and A n plots (Figure ID, Figure 2B,C, 
Figure 3A) were smoothed using running average in a window of 
5 bp (weights: 0.125, 0.250, 0.250, 0.250, 0.125). For large scale 
plots of dinucleotide frequency, divergence and (Figure 1C, 
Figure 2D, Figure 3B), flanking regions were smoothed using 
running average smoothing with a window of 50 bp of equal 
weights. In those plots, the central nl47 regions were smoothed 
separately using a 30 bp window of equal weights. Regions of 5 bp 
upstream and downstream of the nl47 edges were trimmed prior 
to smoothing for large-scale plots. 

Structural mapping of divergence rates 

To elucidate the distribution of the smoothed (as above) 
divergence on DNA from the structure of the nucleosome we 
colored-coded values of each base pair in a schematic rendering of 
the DNA strands of pdblkx5 [55] using PyMOL [89]. Only the 
"top" turn of the DNA (base pairs 73 to —6) is shown. Bases 73 to 
70 were rendered as grey, due to extreme values induced by 
MNase sequence bias. 
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Supporting Information 

Figure SI Paired end sequencing of MNase released mono- 
nucleosomal fragments produces a range of fragment sizes. (A) 
Electrophoretic separation of DNA isolated from digestions of 
embryonic chromatin with three dilutions of MNase (at top). (B) 
Distributions of sizes of sequenced nucleosomal fragments. The 
proportion of paired-end sequenced fragments of each size is 
plotted against the size (bp). The sizes of 8,124,351 fragments that 
mapped to the 96.6 Mbp of euchromatic arms from a single 
GA2x lane of the initial library prepared from DNA isolated from 
the 1 x digestion in panel A is shown in black. Almost completely 
superimposed are the distributions of sizes of fragments from four 
GA2x lanes of a size-selected fraction of that initial library 
(numbers: 7,892,551, 7,947,142, 7,871,834 and 7,825,097; red, 
blue, green and magenta respectively). Also shown in light green 
is the position of the nl47 fraction used in the most of the 
analyses. 
(EPS) 

Figure S2 Dinucleotide frequencies across intergenic (black) and 
intronic (red) nl47 regions display a ~10bp periodicity. 
Dinucleotide frequencies are represented for bases 0 to —73 
across nl47 and —50 bp of flanking sequence. 
(EPS) 

Figure S3 Nucleotide frequencies surrounding intergenic and 
intronic nl47 regions show a nucleosome-scale periodicity. Single 
nucleotide frequencies surrounding nl47 regions (±600 bp). 
(EPS) 

Figure S4 Per-site divergence on the melanogaster lineage for the 
substitutional pathways displays ~10bp periodicity, consistent 
with per-nl47 results. Smoothed average per-site polarized diver- 
gence across intergenic and intronic nl47 regions (±50 bp) for 
specific substitutions is shown. Indicated dinucleotide frequencies 
are plotted above for reference. 
(EPS) 

Figure S5 Polarized divergence on the melanogaster lineage is 
periodic across nl47 regions. (A) Smoothed average polarized 
divergence for specific substitutions across regions defined by non- 
overlapping subsets of intergenic and intronic nl47 (±50 bp). 
Intergenic dinucleotide frequencies are represented above for 
reference. (B) Smoothed intergenic A— »G divergence mapped on 
to bases +73 to —6 from the nucleosome structure [55]. Arginines 
that contact the minor groove are color coded by histone. (C) 
Smoothed average combined A— »G:T-^C polarized divergence 
surrounding intergenic (black) and intronic (red) nl47 regions 
(± 1 kb). Read depth of intronic (light grey) and intergenic (grey) 
n 142- 152 is represented below. 
(EPS) 

Figure S6 MNase cleavage sites show consistent sequence bias at 
the boundaries of intergenic and intronic nl47 fragments. Average 
5'-3'nucleotide frequencies surrounding the MNase cleavage sites 
for intergenic and intronic nl47 fragments (±30 bp). 
(EPS) 

Figure S7 Divergence of six substitutional pathways along the 
melanogaster lineage correlate with GC content and nucleosome 
"occupancy". A) Plots of the rate of divergence for intergenic 
nl47 sequences for six substitutional pathways versus the log 10 
of coverage by nl42-152 ("occupancy"). Correlation coeffi- 
cients are reported in Table S3. B) Plots of the rate of 
divergence for intergenic nl47 sequences for six substitutional 
pathways versus their fraction GC. Correlation coefficients are 
reported in Table S3. Red lines show the linear regression fits, 



while the green represent the least-square fits to a third degree 

polynomial. 

(EPS) 

Figure S8 G^A:C^>T divergence on the melanogaster lineage 
and AA/TT frequency varies with nl47 GC content and 
nucleosome "occupancy." (A) Average frequency of AA/TT 
dinucleotides (green) and rate of G— »A:C— »T divergence 
(orange) for intergenic nl47 (and 1 kb flanking) regions binned 
by nucleosome "occupancy", i.e., coverage of the nl47 by the 
larger set of nl42-152 nucleosomal fragments. nl47 regions 
were ranked by coverage and separated into 5 bins (top is 
lowest and bottom is highest). Average divergence and AA/TT 
frequencies are shown for each bin. Intergenic n 142- 152 
coverage is represented in grey (scale on blue axis). Although 
the axis limits for coverage vary, the scale of these axes is 
consistent. (B) Average G^>A:C^T divergence across inter- 
genic nl47 regions for 5 bins based on n 142- 152 coverage (Q_l 
- lowest, Q5 - highest). (C) Average G^A:C^T divergence 
across intergenic nl47 regions for 5 bins based on ranked nl47 
GC content (Q.1 - lowest, Q5 - highest). (D) Scatter plot of 
intergenic nl47 GC frequency against logi 0 nl42-152 cover- 
age. Spearman's p is reported. 
(EPS) 

Figure S9 Per-nl47 SNP frequencies display a ~ 10 bp periodic 
variation across intergenic and intronic nl47 regions. Smoothed 
average per-nl4 7 frequency of A (A) or T (B) alleles for G-A or C-T 
polymorphic sites respectively in the Raleigh sample (purple). 
Frequencies after removal of singleton classes are plotted in blue. 
(EPS) 

Figure S10 Polarized per-nl47 SNP frequencies across intergenic 
nl47 regions show evidence of ~ 10 bp periodicities aligning with 
dinucleotide periodicities. (A) Average smoothed (black) and 
unsmoothed (grey) per-nl47 frequencies of derived alleles for 
ancestrally G to G-A polymorphic, ancestrally C to C-T 
polymorphic and ancestrally A to A-G polymorphic sites in 
intergenic regions. (B) The same as A but with singleton classes 
(where only one allele of the derived or ancestral base was 
observed) removed. 
(EPS) 

Figure Sll Polarized per-nl47 SNP frequencies across intronic 
nl47 regions show evidence of ~10 bp periodicities aligning with 
dinucleotide periodicities. (A) Average smoothed (black) and 
unsmoothed (grey) per-nl47 frequencies of derived alleles for 
ancestrally G to G-A polymorphic, ancestrally C to C-T 
polymorphic and ancestrally A to A-G polymorphic sites in 
introns. (B) The same as A but with singleton classes (where only 
one allele of the derived or ancestral base was observed) removed. 
(EPS) 

Figure SI 2 The weighting of estimates of A n has little impact on 
the ~10 bp periodic patterns for per-site A n and per-nl47 A n . 
Average smoothed (black) and unsmoothed (grey) per-site A K across 
nl47 regions (±50 bp) for G^GA, C^CT, and A^AG 
polymorphic sites. Color-coded dinucleotide frequencies are 
included for reference. 
(EPS) 

Figure S13 The patterns of per-nl47 A n across five indepen- 
dently sampled non-overlapping subsets of intergenic and intronic 
nl47 sequences retain much of the ~10 bp periodic signal. 
Average smoothed (black) and unsmoothed (grey) A n (Raleigh 
data) for G^GA, C^CT, and A— >AG polymorphic sites across 
regions defined by non-overlapping subsets (see Materials and 
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Methods) of intronic and intergenic nl47 (±50 bp). Frequencies of 

the indicated dinucleotides are plotted for reference. 

(EPS) 

Figure S14 Per-nl47 A K across intergenic and intronic nl47 for 
Rwandan lines shows evidence of nucleosome-associated SNP 
periodicities in an independent sample. Average smoothed (black) 
and unsmoothed (grey) per-nl47 A n for Rwandan lines for G^>GA, 
C^CT, and A^AG polymorphic sites across regions defined by 
intronic and intergenic nl47 (±50 bp). Frequencies of the 
indicated dinucleotides are plotted for reference. 
(EPS) 

Figure S15 Ascertainment bias resulting from SNPs in the 
experimental strain does not explain the ~ 10 bp periodic patterns 
of Au. Average smoothed (black) and unsmoothed (grey) intronic 
per-nl47 A n across nl47 regions (±50 bp) for G^GA, C^CT, 
and A-^-AG polymorphic sites in the Raleigh sample where the 
experimental strain had the ancestral allele. 
(EPS) 

Figure S16 A^G:T-h>C A^ in the regions flanking nl47 
sequences. Average combined A-^>G:T^C divergence (green) 
and A— »AG:T— »TC A n (black) surrounding intergenic and 
intronic nl47 (±1 kb). Read depth of intronic or intergenic 
n 142- 152 is represented in grey below (axis label in blue). 
(EPS) 

Figure S17 Genomic average values of divergence, Jt and A n 
follow trends seen for nl47 regions. Average scaled genome-wide 
n, A n and divergence for specific substitutions, partitioned by 
intergenic versus intronic (see Table S4). 
(EPS) 
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